Language and variety verification on broadcast news for Portuguese

نویسندگان

  • Jean-Luc Rouas
  • Isabel Trancoso
  • Céu Viana
  • Mónica Abreu
چکیده

This paper describes a language/accent verification system for Portuguese, that explores different type of properties: acoustic, phonotactic and prosodic. The two-stage system is designed to be used as a pre-processing module for the Portuguese Automatic Speech Recognition (ASR) system developed at INESC-ID. As the ASR system is applied everyday to transcribe the evening news from a Portuguese public TV channel, the presence of other languages (mainly English) and other varieties of Portuguese is very likely. In the first stage, for each automatically detected speaker, the system verifies if the spoken language is Portuguese, as opposed to nine other languages – English, Belgian Dutch, Croatian, Czech, Galician, Greek, Hungarian, Sloven and Slovak. The identified Portuguese speakers are then fed to the second stage which aims at identifying the Portuguese variety: European, Brazilian or African Portuguese from 5 countries. The identification results are then used either to mark the speech data as untranscribable or forward it to the European Portuguese ASR system, or a system tuned for other languages or varieties. The language verification system achieved an equal error rate for European Portuguese of 2.5%. In terms of variety identification, the overall rate of correct identification was 83.9%, when considering only the 3 broad varieties, and the best results were obtained for Brazilian Portuguese, also the variety that proved easiest to identify in perceptual experiments. The identification rate between African varieties themselves is relatively low, a fact that was also observed in the perceptual experiments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting variety-dependent phones in portuguese variety identification applied to broadcast news transcription

This paper presents a Variety IDentification (VID) approach and its application to broadcast news transcription for Portuguese. The phonotactic VID system, based on Phone Recognition and Language Modelling, focuses on a single tokenizer that combines distinctive knowledge about differences between the target varieties. This knowledge is introduced into a Multi-Layer Perceptron phone recognizer ...

متن کامل

Statistical Machine Translation of Broadcast News from Spanish to Portuguese

In this paper we describe the work carried out to develop an automatic system for translation of broadcast news from Spanish to Portuguese. Two challenging topics of speech and language processing were involved: Automatic Speech Recognition (ASR) of the Spanish News and Statistical Machine Translation (SMT) of the results to the Portuguese language. ASR of broadcast news is based on the AUDIMUS...

متن کامل

Automatic Speech Recognition and Identification of African Portuguese

This document deals with speech recognition of different Portuguese varieties, it resumes results from the author’s diploma thesis [9]. The performance of a hybrid large vocabulary continuous speech recognizer, which combines multi-layer perceptrons and Hidden Markov Models, degrades heavily in the presence of African Portuguese varieties in broadcast news. Variety-specific acoustic and languag...

متن کامل

Automatic Recovery of Punctuation Marks and Capitalization Information for Iberian Languages

This paper shows experimental results concerning automatic enrichment of the speech recognition output with punctuation marks and capitalization information. The two tasks are treated as two classification problems, using a maximum entropy modeling approach. The approach is language independent as reinforced by experiments performed on Portuguese and Spanish Broadcast News corpora. The discrimi...

متن کامل

Porting an european portuguese broadcast news recognition system to brazilian portuguese

This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News recognition system originally developed for European Portuguese to other varieties. Concretely, in this paper we have focused on porting to Brazilian Portuguese. The impact of some of the main sources of variability has been assessed, besides proposing solutions at the le...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Speech Communication

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2008